Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.
Gradient Descent (GD) is an optimization algorithm used to minimize the cost or loss function in machine learning and deep learning. It iteratively adjusts parameters (weights) to reduce the error between predicted and actual outcomes. The idea is to move towards the minimum value of the loss function by following the negative gradient. The cost function represents how well a model fits the data. Gradient descent seeks to minimize this cost.
Gradient descent is a way to minimize an objective function \(J(\theta)\), parameterized by a model’s parameters \((\theta \in \mathbb{R}^d )\), by updating the parameters in the opposite direction of the gradient of the objective function, \(\nabla J(\theta)\), with respect to the parameters \(\theta\).
The learning rate, \(\eta\), determines the size of the steps we take to reach a (local) minimum.
Why Gradient Descent? It works well for optimizing high-dimensional problems, such as neural networks.
#1 Steps Involved in Gradient Descent
1. Initialize the Parameters
Begin by initializing the model parameters \(\theta\) with some values (usually small random numbers).
These parameters will be updated iteratively to minimize the objective function.
2. Calculate the Gradient
Compute the gradient of the objective function \(J(\theta)\) with respect to the parameters \(\theta\). This is the derivative \(\nabla J(\theta)\), which indicates the direction of the steepest ascent.
3. Update the Parameters
Update the parameters \(\theta\) in the direction opposite to the gradient to minimize the objective function: \[
\boldsymbol{\theta_{new} = \theta_{old} - \eta \cdot \frac{d}{d\theta}(J(\theta))}
\] where \(\eta\) is the learning rate, controlling the step size.
4. Repeat Until Convergence
Continue iterating through steps 2 and 3 until the algorithm converges, i.e., the changes in the parameters become very small or the objective function reaches a desired minimum. e.g., \[\theta_{old} - \theta_{new} \approx 0.002\]
5. (Optional) Check for Convergence Criteria
Convergence can be determined by one or more of the following:
The gradient \(\nabla J(\theta)\) becomes sufficiently close to zero.
The change in the value of \(J(\theta)\) between iterations falls below a predefined threshold.
A fixed number of iterations is reached.
6. Return the Final Parameters
Once the parameters have converged, return the final optimized values of \(\theta\).
Gradient Descent Algorithm
Gradient Descent Pseudo-code
Initialize parameters \(\theta\) (e.g., \(\theta_0, \theta_1, ..., \theta_n\)) randomly or with zeros
Choose learning rate \(\eta\) (a small positive number)
Set the number of iterations (max_iters) or define a convergence criterion
Repeat until convergence or for max_iters times:
Calculate the gradient of the cost function \(J(\theta)\) with respect to each parameter \(\theta\)
For each parameter \(\theta_i\): \[\text{gradient}_i = \left[\frac{\partial}{\partial \theta_1}J(\theta), \frac{\partial}{\partial \theta_2}J(\theta), ... , \frac{\partial}{\partial \theta_i}J(\theta)\right]\] (This is the partial derivative of the cost function with respect to \(\theta_i\))
Update the parameters \(\theta\) using the gradient and learning rate \(\alpha\):
For each parameter \(\theta_i\): \[\theta_i = \theta_i - \eta \cdot \text{gradient}_i\]
Optionally: Check if the change in the cost function or parameters is small enough (convergence criterion)
Return the optimized parameters \(\theta\)
Taking an example
Let say, if objective function \(J(\theta)\) where \(\theta \in (m, c)\), then Gradient descent will be
Model is said to be converged when m[i] - m[i-1] ≈ 0.01
Finding Minimum from any point
#2 Learning Rate η
The learning rate is a hyperparameter in machine learning that controls the step size at which the parameters \(\theta\) of a objective function \(J(\theta)\) are updated during training. It’s important to achieve a desirable learning rate because both low and high learning rates can waste time and resources. - Too large: The algorithm might overshoot the minimum, causing divergence. - Too small: The convergence will be slow, and it might get stuck in local minima.
A desirable learning rate is one that’s low enough so that the model/network converges to something useful but high enough so that it can be trained in a reasonable amount of time.
Learning Rate effect
#3 Some Concepts
Before understanding Gradient descent, we need to learn some things:
1. Loss function
This represents the error for a single training example. It measures how far off a model’s prediction is from the actual value for a single data point. For example, in regression, the loss for a single example could be the squared error: \[\text{Loss} = (y_i - \hat{y_i})^2\]
2. Cost function
The cost function is typically the average of the loss over the entire training set, giving an overall measure of model error across all data points. For example, in Mean Squared Error (MSE), the cost function over \(m\) np. of samples is: \[J(W) = \frac{1}{m}\sum_{i=0}^{m}(y_i - \hat{y_i})^2\]
3. Convex function
A convex function is a type of function in mathematics and optimization that has a shape where any line segment drawn between two points on the function’s curve lies above or on the curve itself. Convex functions have specific properties that make them particularly important in optimization because they are easier to work with, especially when it comes to finding global minima.
For a convex function:
Upward Curvature: The function curves upward or is a straight line.
Single Global Minimum: There is either a unique global minimum (the lowest point on the curve) or the function is flat (in which case every point is a minimum).
import matplotlib.pyplot as pltimport numpy as npipt = np.linspace(-10,10,100)plt.style.use('fivethirtyeight')plt.figure(figsize=(5,3))plt.plot(ipt**2, label="f(x)")plt.plot(0.5*ipt+70, color='black', linewidth =1)plt.title("Convex function", color='brown')plt.axis('off')plt.legend()plt.show()
4. Non-Convex function
A non-convex function have multiple peaks and valleys, resulting in multiple local minima and local maxima. This makes finding the global minimum (the lowest point) challenging because optimization algorithms like Gradient Descent can easily get stuck in one of these local minima rather than finding the lowest possible value.
The loss functions used in deep learning (e.g., mean squared error or cross-entropy) are generally non-convex due to the large number of parameters and the complex network structure, leading to highly irregular optimization landscapes.
For a non-convex function:
Multiple Local Minima and Maxima: non-convex functions often have several local minima and maxima.
Complex landscape: Non-convex functions often have irregular, oscillating shapes. This creates a “rough” landscape with many ups and downs.
Curved in Different Directions: Non-convex functions may have different curvatures in different directions, which can create valleys, peaks, and saddle points.
Examples: \(f(x) = x^4 - x^3\) \(f(x) = \text{sin}(x)\) etc.
Code
import matplotlib.pyplot as pltimport numpy as npimport matplotxplt.style.use('fivethirtyeight')ipt = np.linspace(0,600,100)*np.pi/180plt.figure(figsize=(5,3))plt.plot(np.sin(ipt)+0.08*ipt, label='f(x)')plt.plot(0.05*ipt+0.44, color='black', linewidth=1)plt.axis('off')plt.title("Non Convex function", color='brown')plt.legend(loc='lower left')plt.show()
Feature
Convex Function
Non-Convex Function
Shape
Curves upward, or is flat/linear
Complex, with multiple peaks and valleys
Global Minimum
Has a unique global minimum or is flat
Can have multiple local minima and a global minimum
Local Minima
No local minima other than the global minimum
May have multiple local minima, leading to optimization challenges
Optimization
Easier to optimize, as Gradient Descent is guaranteed to find the global minimum
Difficult to optimize, as Gradient Descent can get stuck in local minima
Examples
Quadratic (f(x) = x2), exponential (f(x) = ex), linear
Can be positive, zero, or negative, varies across the domain
Gradient Descent
Converges to global minimum reliably
May converge to a local minimum or saddle point, not guaranteed to find global minimum
#4 Optimization challanges in Deep Learning
Optimization in the case of convex functions is a relatively straightforward task. In machine learning, for example, the loss function for linear regression is the mean squared error (MSE), which is always convex. This means that there is only a single optimal minimum which is the global minimum in the entire MSE function, and we do not need to worry about the existence of local optima.
However, in the case of neural networks, we have many parameters (weights and biases) to train in order to get the optimal solution and minimize the loss. This is because neural networks are composed of many non-linear functions that are stacked (layer after layer) on top of each other. The composition of non-linear functions can produce a non-convex loss function. Non-convex functions have many local minima, out of which only one is the global minimum. Our goal is to reach the global minimum by avoiding these local minima.
Local Minima, Global Minima and Saddle Point
1. The Local Minima
Non-convex functions often have multiple local minima, where the loss function reaches a low point but not the global minimum.
Gradient Descent and its variants can get stuck in these local minima, causing the network to converge to suboptimal solutions. This means the model may not perform as well as it could if it found the global minimum.
Techniques like Stochastic Gradient Descent (SGD), momentum, and adaptive optimizers (e.g., Adam) help by introducing some variability in the optimization path, making it possible to escape shallow local minima.
Local vs Global minima
2. The Initial Guess problem
Choosing the right hyperparameters is crucial for optimization in neural networks, and the learning rate is one of the most important. The learning rate controls the step size in each update during gradient descent. If we set it too low, the optimization process will be slow, requiring many updates to approach the minimum. Conversely, if we set the learning rate too high, the updates may overshoot the minimum, leading to oscillations or even divergence, preventing convergence.
The learning rate also plays a role in avoiding local minima. In a non-convex loss landscape, local minima are points where the loss is lower than the nearby region but not the lowest possible value (the global minimum). Local minima typically have low gradients around them, making it easy for the optimizer to settle there. By setting a larger learning rate, the optimizer can “jump” over these shallow valleys, avoiding some local minima and continuing its search for a better solution. However, a high learning rate can also cause the optimizer to miss deeper minima (like the global minimum) by overshooting them.
Code
import numpy as npimport matplotlib.pyplot as pltplt.style.use('default')# Define a non-convex function with a local and global minimadef f(x):return3*x**4-5*x**2+ xdef g(x):return12*x**3-10*x +1# Parameterslearning_rates = [0.07, 0.05] # Different learning rates to demonstrate overshootinginitial_position =1.4# Start point near the local minimumiterations =40# Number of iterations for the gradient descentmax_position_value =10# Maximum value for position clipping# Prepare the plotfig, axes = plt.subplots( 2,1, figsize=(8,10))# Loop through each learning rate and plot in a separate subplotfor i, learning_rate inenumerate(learning_rates):# Initialize the ball position position = initial_position trace_x = [position] trace_y = [f(position)]# Perform gradient descent with the current learning ratefor _ inrange(iterations): gradient = g(position) position = position - learning_rate * gradient# Clip the position to avoid it growing too large (overflow safeguard)ifabs(position) > max_position_value: position = max_position_value * np.sign(position) trace_x.append(position) trace_y.append(f(position))# Plotting the function x = np.linspace(-1.5, 1.5, 400) y = f(x) ax = axes[i] ax.plot(x, y, label=r'$f(x) = 3x^4 - 5x^2 + x$', color='b') # Function plot# Mark the local and global minima local_min_x =0.85 global_min_x =-0.95 ax.scatter([global_min_x, local_min_x], [f(global_min_x), f(local_min_x)], color='red', zorder=5)# Mark the minima points with legends ax.text(global_min_x, f(global_min_x) -0.5, 'Global Minimum', color='red', fontsize=12, ha='center') ax.text(local_min_x, f(local_min_x) -0.5, 'Local Minimum', color='red', fontsize=12, ha='center')# Plot the ball trace ax.plot(trace_x, trace_y, 'go-', label='Ball Trace (Overshooting)'if learning_rate ==0.07else'Ball Trace (No Overshooting)')# Title and labels ax.set_title(f'Demonstration with Learning Rate = {learning_rate}') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.axhline(0, color='black', linewidth=1) ax.axvline(0, color='black', linewidth=1) ax.set_ylim(-4, 4) ax.legend() ax.grid(True)plt.suptitle("Comparison of Gradient Descent with Different Learning Rates", fontsize=16)plt.tight_layout()plt.show()
3. Saddle points and Pleteaus
In high-dimensional spaces (such as those in deep neural networks), saddle points or minimax point are very common. A saddle point is a point on the surface of a function where the slope is zero (like at a minimum or maximum), but it’s neither a true minimum nor a true maximum.
Imagine you’re standing on a mountain pass: a point between two peaks that is the lowest point along a ridge, but not the lowest point in all directions. In one direction, the ground slopes up toward the peaks on either side (so it’s a low point in that direction). But if you look the other way, it slopes down into valleys (making it a high point in that direction).
In the context of optimization (like gradient descent in machine learning), a saddle point is tricky because the slope is zero, so it may “fool” the algorithm into thinking it has reached a minimum or maximum. However, it’s just a flat spot that doesn’t represent the lowest or highest possible point overall.
It is said minimax point as from one direction, it is the local minimum point, and from another direction, it is local maximum point.
It has name saddle point because its shape looks like a saddle on horse.
A saddle point (in red) on the graph of \(z = x^2 − y^2\)
Saddle point between two hills
Non-convex functions can have flat regions or plateaus where the gradient is close to zero over a range of values. A plateau is a flat, level area in a landscape. Imagine a long stretch of land where there’s almost no change in elevation; it’s just flat and goes on for a while without any uphill or downhill movement. In the context of a loss landscape (the landscape of values that show how good or bad a model’s predictions are), a plateau is an area where the loss value doesn’t change much as you move across it. When a gradient descent algorithm (the “walker”) enters a plateau, it struggles to find a clear direction to go in because the “slope” is very shallow or zero, making progress slow or causing it to get “stuck” temporarily.
Plateau in region
Saddle Point: A single point with both upward and downward slopes in different directions.
Plateau: A wide, flat area with no steep slope in any direction, causing slow movement through it.
4. Sensitivity to Initial Parameters
Gradient Descent’s convergence can be highly dependent on the initial values of the parameters. Poor initialization can lead to slower convergence or getting stuck in poor local minima (for non-convex functions). In deep learning, poor initialization can result in dead neurons (especially in ReLU networks) and significantly slow down training. Use weight initialization techniques like Xavier or He initialization, which take into account the layer size and the type of activation function. Randomized initialization is also better than starting with zeros, as it breaks symmetry.
5. Scaling and Feature Normalization
Gradient Descent is sensitive to the scale of the features. When features have widely different scales, the algorithm might struggle with convergence. Unscaled features can lead to erratic updates, slow convergence, and suboptimal results. Use feature scaling (such as standardization or normalization) to ensure that features are on a similar scale.
image.png
#5 Types of Gradient Descent
There are three variants of Gradient Descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of parameter update and the time it takes to perform an update.
1. Batch Gradient Descent (Vanilla)
Batch Gradient Descent is an optimization algorithm used to minimize the cost function of a model by updating its parameters in the direction that reduces the overall error.
The gradient is calculated using the entire dataset at each iteration. The term “batch” indicates that we use the full batch of data points in each update. This makes the updates more stable but can be computationally expensive if the dataset is very large.
Consider this Pseudo code:
Set learning rate η Set number of epochs for epoch in range(epochs): # Step 1: Forward Pass i.e., Predict y_hat for all data y_hat = np.dot(X, W)
# Step 2: Calculate Cost function (e.g., Mean Squared Error) Cost, L = (1 / m) * sum((y - y_hat) ** 2) # m : no. of samples
# Step 3: Calculate Gradient gradient = dL/dW
# Step 4: Update Weights W = W - η * gradient
# Optionally print or store loss for monitoring print(“Epoch:”, epoch, “Loss:”, loss)
Return final weights W
Explanation
1. Initialize weights: Start with random weights \(W\) 2. Learning rate and Epochs: Define the learning rate \(\eta\) and the number of epochs. 3. Loop through epochs: 3.1 Calculate Predictions: Compute predictions \(\hat{y}\) for each data point \(x_i\) using the dot product: y_hat = np.dot(X,W) 3.2 Compute Cost Function: Use the appropriate Cost function \(J(W)\). This provides a measure of the model’s error with the current Weights for entire dataset. 3.3 Calculate Gradient\(\boldsymbol{\nabla_WJ(W)}\): The gradient is the derivative of the Cost function w.r.t each weight \(W\), \(\large \frac{\partial J(W)}{\partial w}\) , which we use to update weights \(W\).
4. Update Weights: Update the weights \(W\) by moving in the opposite direction of the gradient:\[{W = W - \eta \cdot \nabla_{W} J(W)}\]
Repeat steps 3.1 - 3.3 for a specified number of epochs or until the cost function converges (i.e., does not decrease significantly with further updates).
Number of times weight update = number of epochs.
Gradient Descent Convergence
Contour plot of Convex function with Batch GD reaching on Global minimum smoothly
Advantages
Steady and Accurate Convergence to a Minimum: Batch Gradient Descent will normally yield a direct path to a minimum. This means little time will be spent moving in incorrect directions in weight space if the learning rate \(\eta\) is not too big. This also means Batch Gradient Descent will yield a very accurate estimate of the minimum as shown above.
Easy Implementation: The Batch Gradient Descent algorithm is quite simple to implement in code, when compared to other variations. The number of hyperparameters introduced is kept to a minimum, and the implementation can be fully vectorised.
Excellent for Convex Functions: Batch Gradient Descent works very well for convex functions with smaller-sized datasets. In such a scenario, all derivatives point to the single global minimum, and there will be no issue loading all the training data into memory for processing.
Limitations
Unpractical for Larger Datasets: As all the training data is used to compute each step in our algorithm, this approach is very computationally expensive. All the training data must be loaded into memory. For Big Data, this simply isn’t feasible.
Slow to Converge: Related to the first point; since all the training data is used to compute each step, this puts strain on the processing resources we have available. The algorithm may only slowly converge towards a minimum for larger datasets.
Difficult to Escape Local Minima: For non-convex loss functions, many local minima will be present in weight space. Batch GD tends to lead towards a direct path to the nearest minimum. As such, this algorithm can get stuck at a local minimum in these situations.
Not Suitable for Online or Real-Time Learning: : Batch Gradient Descent requires the entire dataset for each update, which makes it unsuitable for online learning (where data is received continuously over time). In applications like recommendation systems, stock market prediction, or fraud detection, where data arrives sequentially, Gradient Descent would need to retrain the model repeatedly on the entire dataset, which is inefficient.
2. Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a variant of the classic Gradient Descent (GD) used to train machine learning models, including neural networks. While Batch Gradient Descent computes the gradient using the entire dataset, SGD computes the gradient based on a single randomly chosen data point.
Consider this Pseudo code:
Set learning rate η Set number of epochs loss_at_each_epoch = []
for epoch in range(epochs): # shuffle the data.
epoch_loss = 0
for j in range(X.shape[0]): # Step 1: predicting y_hat for single data point x[j] y_hat[j] = weight_transpose * x[j] + bias
# Step 2: Calculate Loss (e.g., Mean Squared Error) Loss, L = (y_hat[j] - y[j]) ** 2 epoch_loss = epoch_loss + Loss # Step 3: Calculate Gradient gradient = dL/dW , dL/db
# Step 4: Update Weights and bias W = W - η * gradient b = b - η * gradient
# calculate average loss for each epoch average_loss = epoch_loss / X.shape[0] loss_at_each_epoch.append(average_loss)
1. Initialize Parameters: The model weights and bias are initialized randomly. 2. Learning rate and Epochs: Set the learning rate \(\eta\) and the number of epochs. 3. Loop through epochs: Repeat for each epoch until the specified number of epochs. 3.1 Shuffling: Randomly shuffle the dataset at the beginning of each epoch to ensure training order doesn’t impact the model. 3.2 Loop through each training sample: 3.2.1 Calculate prediction: Compute prediction \(\hat{y_j}\) for each row selected at random using \[\hat{y} = W^T \cdot x + b\]3.2.2 Compute Loss: Compute loss \((y - \hat{y_j})^2\) for each prediction. 3.2.3 Calculate Gradient\(\boldsymbol{\nabla_WJ(W)}\): Compute the gradient \(\frac{\partial J(W)}{\partial w}\) of the loss function with respect to the model’s parameters{weights and bias} for that single sample. 3.2.4 Update Weights: Update the weights \(W\) by using the gradient:\[{W = W - \eta \cdot \nabla_{W} J(W)}\]
4. Loop through epochs:Track the loss for each sample and compute the average loss at the end of the epoch. This provides insight into how the model is improving.. Repeat the step 3 for a specified number of epochs or until the convergence (defined as when the average loss does not significantly decrease over consecutive epochs.)
Number of times weight update = (number of epochs) x (number of training samples)
Stochastic gradient descent near a piecewise linear saddle point.
Contour plot of Convex function with SGD reaching on Global minimum with much noise
Code
import numpy as npimport plotly.graph_objects as gofrom pydataset import dataimport plotly.express as pxfrom plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotimport cufflinks as cfinit_notebook_mode(connected=True)cf.go_offline()# Generate a simple dataset for a linear regression task with 2D weightsnp.random.seed(42)X_data = np.linspace(-2, 2, 100).reshape(-1, 1)X_data = np.hstack((X_data, np.ones_like(X_data))) # Adding a bias columny_data =3* X_data[:, 0] +2+ np.random.normal(0, 0.5, X_data.shape[0]) # Linear relationship with noise# Define the loss function (Mean Squared Error) and gradientdef loss_function(W, X, y):# Add a sinusoidal term to make the function non-convex preds = W[0] * X[:, 0] + W[1] * X[:, 1]# The sinusoidal term introduces local minimareturn np.mean((preds - y)**2) +5* np.sin(W[0]) * np.cos(W[1])def gradient(W, x, y):# Calculate the predictions and error preds = W[0] * x[:, 0] + W[1] * x[:, 1] error = preds - y# Compute the gradient with an additional sinusoidal component grad_w0 = np.mean(2* error * x[:, 0]) +5* np.cos(W[0]) * np.cos(W[1]) grad_w1 = np.mean(2* error * x[:, 1]) -5* np.sin(W[0]) * np.sin(W[1])return np.array([grad_w0, grad_w1])# Initialize parameters for SGDlearning_rate =0.1epochs =1W = np.array([-2.0, -2]) # Starting with higher values for weights# Lists to store weight and loss values for plottingW_vals, loss_vals = [W.copy()], [loss_function(W, X_data, y_data)]# Perform SGDfor epoch inrange(epochs):# Shuffle data indices indices = np.random.permutation(len(X_data)) X_data_shuffled = X_data[indices] y_data_shuffled = y_data[indices]# Update weights for each sample in the datasetfor i inrange(len(X_data_shuffled)): x_i = X_data_shuffled[i].reshape(1, -1) y_i = y_data_shuffled[i]# Compute gradient using single data point grad = gradient(W, x_i, np.array([y_i]))# Update weights W -= learning_rate * grad# Store values for 3D plot W_vals.append(W.copy()) loss_vals.append(loss_function(W, X_data, y_data))# Convert weight values into 3D coordinates for plottingW_vals = np.array(W_vals)loss_vals = np.array(loss_vals)x_path, y_path = W_vals[:, 0], W_vals[:, 1]z_path = loss_vals# Create a meshgrid for the 3D surface of the convex functionx_range = np.linspace(-3, 6, 50)y_range = np.linspace(-3, 6, 50)X, Y = np.meshgrid(x_range, y_range)Z = np.array([loss_function([x, y], X_data, y_data) for x, y inzip(np.ravel(X), np.ravel(Y))])Z = Z.reshape(X.shape)# Plotting with Plotlyfig = go.Figure()# Add the 3D surface for the convex function (loss landscape)fig.add_trace(go.Surface(z=Z, x=X, y=Y, colorscale="Viridis", opacity=0.8))# Add the noisy SGD path as a scatter plotfig.add_trace(go.Scatter3d( x=x_path, y=y_path, z=z_path, mode='markers+lines', marker=dict(size=5, color='red'), line=dict(color='red', width=3), name="Noisy SGD Path"))# Labels and layout adjustmentsfig.update_layout( title="Stochastic Gradient Descent (SGD) Path on Convex Loss Function", scene=dict( xaxis_title='Weight 1 (W[0])', yaxis_title='Weight 2 (W[1])', zaxis_title='Loss (MSE)', ))fig.show()
Advantages
Faster Convergence (Per Iteration): Stochastic Gradient Descent (SGD) is computationally faster than Batch Gradient Descent (BGD) because it updates parameters based on a single data point rather than the entire dataset. This makes each step of SGD quicker to compute, especially on large datasets.
Escapes Local Minima: The noisy nature of SGD (due to random updates) helps it potentially escape local minima or saddle points and can help find a better global minimum.
Online Learning: SGD can be used for online learning, where the model continuously learns as new data becomes available. It’s well-suited for situations where the data cannot be stored in memory all at once.
Memory Efficiency: Only one (or a few) data points are needed at each iteration, so SGD is more memory efficient compared to batch GD, which requires storing the entire dataset in memory.
Better Generalization: The stochasticity can act as a form of regularization, making the model generalize better on unseen data, as it prevents overfitting to the training set. In SGD, randomness introduces noise into each update step. This randomness helps the algorithm explore the loss surface but can prevent exact convergence to the minimum, resulting in an approximate solution that oscillates around a good region rather than settling precisely at the minimum. Exact convergence is often unnecessary. Slight variations in weights may not significantly impact model performance. Thus, practitioners often accept approximate solutions that offer satisfactory generalization on unseen data, rather than seeking an exact solution that may overfit the training data. Exact optimization is often computationally expensive and, for large datasets and models, may be infeasible within reasonable time limits. Gradient descent methods are designed for efficiency, trading off exactness for speed and scalability.
Disadvantages
Hyperparameter Sensitivity: The learning rate in SGD is crucial. A high learning rate might cause overshooting, while a low learning rate might result in slow convergence. Finding the optimal learning rate requires experimentation.
Vanishing or Exploding Gradients (in Deep Networks): In deep neural networks, the gradients can become very small (vanishing) or very large (exploding) during backpropagation, which can hinder the training process.
Slower Execution (per epoch): Total Convergence Time is often slower in SGD due to the high variance in updates. Since SGD doesn’t compute the gradient over the whole dataset in each step, each individual update has some noise, causing it to take a more erratic path toward the minimum. This high variance can lead to more steps needed overall, potentially resulting in more time to converge to a stable region around the minimum. To reduce this issue, techniques like learning rate decay and momentum can help smooth the path to convergence, but these still don’t eliminate the noisy nature of SGD entirely.
Considering a code example
Code
import numpy as npimport pandas as pdimport timedf = pd.read_csv("https://raw.githubusercontent.com/shivang98/Social-Network-ads-Boost/refs/heads/master/Social_Network_Ads.csv")print("Using a Social_network_data with Shape: ", df.shape)df.head()
Using a Social_network_data with Shape: (400, 5)
User ID
Gender
Age
EstimatedSalary
Purchased
0
15624510
Male
19
19000
0
1
15810944
Male
35
20000
0
2
15668575
Female
26
43000
0
3
15603246
Female
27
57000
0
4
15804002
Male
19
76000
0
## Considering only [Age, EstimatedSalary] as features ## and [Purchased] as target.df = df[['Age','EstimatedSalary','Purchased']]x = df.iloc[:,0:2]y = df.iloc[:,-1]print("Input features.. ")x.head()
Input features..
Age
EstimatedSalary
0
19
19000
1
35
20000
2
26
43000
3
27
57000
4
19
76000
## Scaling the features before feeding into neural network,from sklearn.preprocessing import StandardScalerscaler = StandardScaler()x_scaled = scaler.fit_transform(x)
## Training the network using Batch GD## for Batch GD, we set batch_size = x_train.shape## so that all data is used per epoch to calculate lossmodel.compile(loss='binary_crossentropy', metrics=['accuracy'])start = time.time()history = model.fit(x_train,y_train, epochs=10, batch_size=320)print("\nTime taken in Batch GD: ", time.time() - start)
## Training the network using Stochastic GD## for Stochastic GD, we set batch_size = 1## to update gradient based on each samplemodel2 = Sequential()model2.add(Dense(10,activation='relu',input_dim=2))model2.add(Dense(10,activation='relu'))model2.add(Dense(1,activation='sigmoid'))model2.compile(loss='binary_crossentropy', metrics=['accuracy'])start = time.time()history2 = model2.fit(x_train,y_train, epochs=10, batch_size=1)print("\nTime taken in SGD: ", time.time() - start)
Comparing the time and accuracy of Batch GD and SGD, we can say:
Batch GD is faster per epoch as weight is updated one time per epoch only. But accuracy is 69% indicating that more epochs are needed.
SGD is slower per epoch as weight is updated 320 times per epoch. But accuracy is 89% indicating convergence is achieved.
# plotting Loss graph for SGD with 500 epochsmodel2 = Sequential()model2.add(Dense(10,activation='relu',input_dim=2))model2.add(Dense(10,activation='relu'))model2.add(Dense(1,activation='sigmoid'))model2.compile(loss='binary_crossentropy', metrics=['accuracy'])history2 = model2.fit(x_scaled, y, epochs=500, batch_size=1, validation_split=0.2)
Code
import matplotlib.pyplot as pltplt.plot(history2.history['loss'])plt.xlabel('Epochs')plt.ylabel('Loss')plt.title('SGD loss @ 500 epochs', color='brown')plt.show()
# plotting Loss graph for SGD with 500 epochsmodel = Sequential()model.add(Dense(10,activation='relu',input_dim=2))model.add(Dense(10,activation='relu'))model.add(Dense(1,activation='sigmoid'))model.compile(loss='binary_crossentropy', metrics=['accuracy'])history = model.fit(x_train,y_train, epochs=500, batch_size=320)
Code
plt.plot(history.history['loss'])plt.xlabel('Epochs')plt.ylabel('Loss')plt.title('Batch GD loss @ 500 epochs', color='brown')plt.show()
Batch Gradient Descent produces smoother loss curves in a stable manner, while Stochastic Gradient Descent introduces noise in unstable manner due to weight updates on individual data points.
Stochastic GD vs Batch GD
Feature
Stochastic GD
Batch GD
Update Frequency
Updates weights after each training example or mini-batch
Updates weights after the entire training set
Computational Complexity
Low per iteration, but requires more iterations (epochs)
High per iteration (processing the entire dataset)
Convergence
Noisy, but can escape local minima, may oscillate around minima
Smooth convergence, but may get stuck in local minima
Memory Usage
Low memory usage (only one data point or mini-batch)
High memory usage (needs entire dataset in memory)
Ideal Use Case
Large datasets, online learning, deep learning (non-convex)
Small to medium-sized datasets, convex problems
Generalization
Better generalization due to noisiness and randomness, but provides approximate solution, not exact one.
May overfit if not regularized properly, but provide exact solution.
Convergence Speed
Slower (needs more epochs) but faster per iteration
Faster convergence (less noisy), but slow due to the full batch
Why randomness and noise in SGD?
The noise and instability in Stochastic Gradient Descent (SGD) primarily arise from updating the model parameters based on individual data points instead of the entire dataset or a large batch.
Since each sample carries unique characteristics (and sometimes outliers or noise), the gradient calculated from a single data point may not accurately represent the direction to the true minimum.
This results in each update step being influenced by the particularities of the sample, leading to high variance in the updates and a more erratic, zigzagging path toward convergence.
Why Doesn’t SGD Cause Overfitting?
Overfitting happens when your model fits too closely to the training data, learning all the small noise and idiosyncrasies. This makes it less able to generalize to new data. SGD prevents it like this:
Noisy Updates Reduce Memorization: SGD uses a single sample per update. This randomness introduces noise, which disrupts the model’s ability to fit too precisely to the training data. In doing so, SGD prevents the model from “memorizing” specific data points, encouraging it to generalize better to unseen data.
Avoidance of Sharp Minima: The noise in SGD often prevents it from settling into sharp, narrow minima in the loss landscape, which can indicate overfitting. Sharp minima are highly tuned to training data specifics and usually perform poorly on test data. Instead, the noisy path of SGD encourages it to find flatter minima, which tend to represent solutions that generalize better across different data.
Regularization Effect Without Additional Terms: By not fully converging on a precise minimum, SGD implicitly regularizes the model, helping it to avoid overfitting without the need for additional penalty terms in the loss function.
In practice, this makes SGD especially effective in deep learning and large datasets, where generalization is key.